Prismer: A Vision-Language Model with Multi-Task Experts

Liu, Shikun; Fan, Linxi; Johns, Edward; Yu, Zhiding; Xiao, Chaowei; Anandkumar, Anima

Computer Science > Machine Learning

arXiv:2303.02506 (cs)

[Submitted on 4 Mar 2023 (v1), last revised 18 Jan 2024 (this version, v3)]

Title:Prismer: A Vision-Language Model with Multi-Task Experts

Authors:Shikun Liu, Linxi Fan, Edward Johns, Zhiding Yu, Chaowei Xiao, Anima Anandkumar

View PDF HTML (experimental)

Abstract:Recent vision-language models have shown impressive multi-modal generation capabilities. However, typically they require training huge models on massive datasets. As a more scalable alternative, we introduce Prismer, a data- and parameter-efficient vision-language model that leverages an ensemble of task-specific experts. Prismer only requires training of a small number of components, with the majority of network weights inherited from multiple readily-available, pre-trained experts, and kept frozen during training. By leveraging experts from a wide range of domains, we show Prismer can efficiently pool this expert knowledge and adapt it to various vision-language reasoning tasks. In our experiments, we show that Prismer achieves fine-tuned and few-shot learning performance which is competitive with current state-of-the-arts, whilst requiring up to two orders of magnitude less training data. Code is available at this https URL.

Comments:	Published at TMLR 2024. Project Page: this https URL Code: this https URL
Subjects:	Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2303.02506 [cs.LG]
	(or arXiv:2303.02506v3 [cs.LG] for this version)
	https://doi.org/10.48550/arXiv.2303.02506

Submission history

From: Shikun Liu [view email]
[v1] Sat, 4 Mar 2023 21:22:47 UTC (12,817 KB)
[v2] Sun, 12 Mar 2023 02:30:16 UTC (12,817 KB)
[v3] Thu, 18 Jan 2024 22:09:40 UTC (12,820 KB)

Computer Science > Machine Learning

Title:Prismer: A Vision-Language Model with Multi-Task Experts

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Machine Learning

Title:Prismer: A Vision-Language Model with Multi-Task Experts

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators